Introduction to Open Data Science - Course Project

About the project

This is an introductory course on open data science which intends to teach students the modern approaches in reproducible research. The main tool in this course is the R programming language, a de-facto standard in the statistical research and processing applicatinos. Moreover, modern best practices in research suggest sharing not only the final result of the study by make it in a easy-to-reproduce way. For studies perfromed mainly with the R language this could be achieved by using the Rmarkdown format for reportimg. Rmarkdown includes both the R source code, explanations and comments in formatted text form as well as the output of the R code. One of the goals of this course is to show the students how to utilise these powerful instruments in the best way and encourage them to apply these techniques in their research.

My personal expectations from this course are quite high. I hope I would be able to get a bit of fluency in R and apply some of the covered approaches in my research.

The project repository: https://github.com/joewkr/IODS-project


Regression and model validation

A short summary and things learned

First of all, we load the data set which we are going to work with. This is a preporcessed data set which provides information about the survey taken by students of a statistics course and it also includes the grades received after the course.

learning2014 <- read.csv('learning2014.csv')
dim(learning2014)
## [1] 166   7

This data set provides information about 166 students for 7 available parameters. These parameters include student’s age and gender, general attitude towards statistics, their replies to survey questions combined in a four categories, and the total number of points obtained during the final exam of the course. Structure of the data set and a sample subset could be obtained with the following command:

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

For a general quick overview of the data set we could use the pairs function to generate a simple scatter plot figure comparing all the variables against each other.

pairs(learning2014)

The points column is of primary interest for us because it provides information about student grades on the exam. As can be seen from the figure there is no easy to spot dependency between exam points and other variables except the attitude. The age column seems to have some interesting features as well, though less pronounced. The majority of students was in their twenties as can will be evident from the estimated distribution function introduced further down in the text. And this big group of students received all the possible grades from very low to high. But if we focus on the older student we could see that for them there seems to be negative correlation between their age and points received at the exam.

To get a more clear overview of dependencies between different variables it is often useful to have also some numerical measure, for example, correlation coefficient to complement scatter plots. Here we calculate correlations between all the numeric variables in the data set:

numeric_columns <- tail(names(learning2014))
cor(learning2014[numeric_columns])
##                  age    attitude        deep        stra       surf      points
## age       1.00000000  0.02220071  0.02507804  0.10244409 -0.1414052 -0.09319032
## attitude  0.02220071  1.00000000  0.11024302  0.06174177 -0.1755422  0.43652453
## deep      0.02507804  0.11024302  1.00000000  0.09650255 -0.3238020 -0.01014849
## stra      0.10244409  0.06174177  0.09650255  1.00000000 -0.1609729  0.14612247
## surf     -0.14140516 -0.17554218 -0.32380198 -0.16097287  1.0000000 -0.14435642
## points   -0.09319032  0.43652453 -0.01014849  0.14612247 -0.1443564  1.00000000

The correlation matrix shows that students which planned to not get too deep into the topic of the course (i.e. the students with high surf score) tend to receive lower grades on the exam. Surprisingly, the students whom indicate in their surveys that they strive to get deeper understanding of the topic (high score in the deep group) did not show better (or worse) exam grades than other student groups. Except the students with high general attitude towards statistics, only the students which tend to use systematic and organized approach to studying (high stra score) were able to get better scores on the final exam.

Another insight on the data set could be obtained from the estimated distributions of the numerical variables:

i <- 1
for(c in tail(names(learning2014))) {
  p <- ggally_densityDiag(learning2014, mapping = aes_string(x=c)) + 
       theme_bw() +  
       theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), aspect.ratio=0.25)
  
  var_name <- paste('p', i, sep='')
  assign(var_name, p)
  i <- i + 1
}

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 6)

There it is clear that the majority of students taken the survey are around 20 year old and there is not too many older students in the group. Distributions for the surf and deep scores show somewhat opposite picture with surf tending to have lower values and deep – higher. I guess to some extent this could be caused by the human factor since the questions in the deep group tend to represent a good student while questions in the surf group describe a somewhat lazy person. Also an interesting observation there is less probability of getting 13…15 points on the exam than nearby values which could be caused by some peculiarities of the grading system.

Finally we could provide the overall summary of the data set which supports the earlier observations:

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

So, which variables could be used to build a linear model for predicting the exam grade? Obviously, attitude should be included since it shows the highest correlation with the exam points. But both deep and surf groups seem to be influenced by human factor making them less applicable for this modeling exercise, unlike the stra group (which also has the second-highest correlation with the exam points). Therefore for building the model we take attitude, stra and age since it shows some interesting features for the older students.

model <- lm(points ~ attitude + stra + age, data = learning2014)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The summary information of the fitted model provides basic overview about the fitted model and gives important information about the statistical significance of the assumed dependency between the target and explanatory variables. For each of the explanatory variables the significance test provides probability of getting the observed values of the target variable if the explanatory variable in question is has random values. As a consequence, when the p-value is high it is less likely that target variable depends on that explanatory variable. For our model, only the attitude parameter shows very low p-value of 4.72e-09 which indicates strong and significant dependency between the exam point and students attitude. The other two explanatory variables stra and age, with p-values of 0.0621 and 0.0981 though still significant assuming the significance level \(\alpha\) = 0.10 show much weaker connection between them and the target variable.

The obtained linear model which is written as:

\[ \text{points} = 10.9 + 3.481\cdot\text{attitude} + 1.004\cdot\text{stra} -0.08822\cdot\text{age} \] shows that with higher attitude and stra scores, students would tend getting higher final exam grade. Although for the older students the negative age coefficient indicates that these students tend to have lower final exam grade than younger students with the same attitude and stra scores.

Another important parameters provided in the summary of our model is the multiple R-squared value which shows how good is the fit of our model. The low value of the multiple R-squared parameter indicate that only about the 20% of variation in the exam grades could be explained by the selected set of explanatory variables. However, this parameter does not indicate whether the selected set of the explanatory variables is optimal or not.

To estimate how good the assumptions taken when fitting the linear model hold we use the following series of diagnostic plots:

par(mfrow=c(3,1))
plot(model, which=c(1,2,5))

There the ‘Residuals vs Fitted’ plot shows the dependency between the fitted model values and residuals calculated as a difference between the observed values and fitted. A sound model should show no clear dependency between the residuals and fitted values. The produced model seems to have reasonable residuals which are more or less evenly distributed across the range of fitted values. Although the worrying point there is the cluster of numbered points (35, 56 and 145) having somewhat larger residual values.

The same pattern could observed on the ‘Normal Q-Q’ plot as well. This plot indicates how well the assumption of normal distribution of the residuals is held (ideally the points should lay on the 1:1 line there). Again on a seemingly reasonable plot the cluster of numbered points could indicate potential issues with the model.

Finally, the ‘Residuals vs Leverage’ plot shows how much weight is given to individual observations when fitting the model. For a well-defined model this plot should not show outlier points with high leverage values. Our model seems to fit reasonably well under this assumption, though it still shows some numbered points. But these points have a relatively low leverage value thus making it a bit less of an issue.


Logistic regression

A short summary and things learned

Again, before we start the analysis we load the data set from the file system. This data set provides information about students of two Portuguese schools and their achievements. It contains received grades (variables G1, G2 and G3) as well as students background including social and school-related information. The original data set could be obtained via the following link https://archive.ics.uci.edu/ml/datasets/Student+Performance and detailed description of the data set variables is also provided. Although the original data set was provided separately for two subjects: Mathematics and Portuguese language, in the current study a preprocessed data set is used. Ths data set is obtained by merging the original data sets.

Variables in the perprocessed data set could be shown by applying the names function to the loaded data frame:

alc <- read.csv('alc.csv')
dim(alc)
## [1] 382  35
names(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

As could be seen from the output, this data set introduces two additional variables, namely alc_use which is a simple mean of Dalc and Walc, and high_use which indicates relatively high alcohol consumption by a student and defined to be TRUE when alc_use takes values greater than the threshold value of 2.

Since the data set provides quite extensive information about students it could be possible to build a statistical model predicting high alcohol consumption based on these data. For this study I choose the four following variables:

selected_columns <- c('famrel', 'studytime', 'absences', 'goout')

where famrel describes quality of student’s family relationships (with higher values corresponding to better relationships), studytime describes how much time students spends on their studies over a week, absences descries the number of times when a student was absent from school, and goout describes how often a students goes out with friends.

I expect these variables to be connected to the alcohol consumption since it would be natural to assume that a person with bad family relationships who does not spend too much time studying, often absent and going out with friends would show higher alcohol consumption.

To get a better overview of the selected variables and study their possible connection to the alchol consumption a graphical summary could be of a great help.

alc %>% 
pivot_longer(all_of(selected_columns), names_to='key', values_to='value') %>% 
ggplot(aes(value, fill=high_use)) + 
  facet_wrap("key", scales = "free") + 
  geom_bar() + 
  theme_bw() +  
  theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())

The bar plots of the selected variables show no striking features supporting the initial hypothesis. Although the goout tends to have higher values for students who have high alcohol consumption. Surprisingly the absences does not seem to show peaks of high absence for students with high alcohol consumption but rather features a gradual increase of drinkers fraction with increasing days of absence. For the studytime variable the fraction of students with high alcohol consumption greatly diminishes with increasing number of hours spent on studies. And finally famrel shows a somewhat similar picture with relatively less number of drinking students in families with good relationships.

This information could also be provided in tabular form which supports the conclusions drawn from the graphs:

for(var in selected_columns) {
  print(table(alc[c('high_use', var)]))
}
##         famrel
## high_use   1   2   3   4   5
##    FALSE   6  10  39 135  78
##    TRUE    2   9  25  54  24
##         studytime
## high_use   1   2   3   4
##    FALSE  58 135  52  23
##    TRUE   42  60   8   4
##         absences
## high_use  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 26 27
##    FALSE 52 38 42 33 24 16 16  9 14  6  5  2  4  1  1  0  0  1  0  2  1  0  0
##    TRUE  13 13 16  8 12  6  5  3  6  6  2  4  4  1  6  1  1  1  1  0  1  1  1
##         absences
## high_use 29 44 45
##    FALSE  0  0  1
##    TRUE   1  1  0
##         goout
## high_use   1   2   3   4   5
##    FALSE  19  84 103  41  21
##    TRUE    3  16  23  40  32

To predict the high alcohol consumption based on the selected variables a logistical model could be applied:

f_string <- paste('high_use ~ ', paste(selected_columns, collapse = ' + '))
model <- glm(as.formula(f_string), data = alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = as.formula(f_string), family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8701  -0.7738  -0.5019   0.8042   2.5416  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.28606    0.70957  -1.812  0.06992 .  
## famrel      -0.33699    0.13681  -2.463  0.01377 *  
## studytime   -0.55089    0.16789  -3.281  0.00103 ** 
## absences     0.06753    0.02175   3.104  0.00191 ** 
## goout        0.75953    0.12041   6.308 2.82e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 384.07  on 377  degrees of freedom
## AIC: 394.07
## 
## Number of Fisher Scoring iterations: 4

Summary of the trained model indicate that all the selected predictors have statistically significant relations with the target variable high_use taking the significance level \(\alpha = 0.05\). Coefficients of the model suggest that absences has the weakest connection with the target variable whereas the other selected explanatory variables show a more strong link.

A bit more insight on the characteristics of the obtained model could be learned by studying the coefficients in form of odd ratios and providing their confidence intervals:

odds_ratio <- model %>% coef    %>% exp
confidence <- model %>% confint %>% exp
## Waiting for profiling to be done...
cbind(odds_ratio, confidence)
##             odds_ratio      2.5 %    97.5 %
## (Intercept)  0.2763573 0.06723596 1.0961732
## famrel       0.7139151 0.54460646 0.9331198
## studytime    0.5764343 0.41040872 0.7941804
## absences     1.0698631 1.02591583 1.1187950
## goout        2.1372720 1.69853389 2.7261579

this table shows that for famrel and studytime odds ratios are below one (namely, 0.714 and 0.576) which means that increasing these parameters would decrease the probability of student being a drinker. Contrary, the odds ratios for absences and goout are above one thus increasing these parameters would increase the probability of high alcohol consumption by a student. Another important observation from this table is that the confidence intervals for all the odds ratios do not include 1 meaning that these odds ratios are significant. This is true also for the absences variable which has the odds ratio of 1.07, which is rather close to 1 but still significant.

To investigate the predictive power of our model it could be used to estimate the probability of high alcohol consumption from the same data set which was applied to train the model. To do so, the predict function can be used and if the resulting probability is higher than 0.5 a student is assumed to be a drinker. This information combined with the actual data on alcohol consumption allows generating the cross tabulation of predicted vs actual values:

probabilities <- predict(model, type = "response")
alc <- alc %>% mutate(probability = probabilities, prediction = probability > 0.5)

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   242   26
##    TRUE     65   49

This table shows an interesting property of the model, it tends to predict the low alcohol consumption better than high one i.e. it gives a considerably high number of false negatives. This property of the model is apparent from the following plot:

alc %>%
ggplot(aes(x = probability, y = high_use)) +
  geom_violin(trim = TRUE, fill='#DCDCDC') +
  geom_vline(xintercept = 0.5) +
  xlim(0,1) +
  theme_bw() +
  theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())

it is evident that for students having low alcohol consumption the model tends to predict consistent low probability of being a drinker. From the other side for drinking students the model yields a somewhat close to uniform distribution of predictions showing no predictive skill.

The total proportion of the failed predictions (i.e. the training error) could be calculated in the following way by taking a simple average of model predictions:

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199

which gives a value of 0.238 meaning that about a quarter of the predictions were incorrect.

Such an approach which uses the whole data set to train a model and the to test it, though simple and straightforward, gives a biased estimate of model quality because all the data points which are used for testing are already ‘known’ to the model since they were used for training. A more robust approach takes only a part of the data set for training the model and keeps a part of it for testing. This approach is applied in the cv.glm function which estimates the model prediction error in presence of unknown observations. There the data set is divided into 10 chunks and only 9 of them are used for model training when the last one is used for validation. This procedure is repeated for all combinations of training and validation chunks returning a K-fold cross-validation prediction error.

cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
cv$delta[1]
## [1] 0.2434555

For the model built in the current exercise the cross-validation procedure gives the prediction error of 0.243 which is higher than the previous estimate because it uses unknown to the model data for validation. This error is lower than the error of the example model (it uses the following formula: high_use ~ sex + failures + absences) given for this course which was about 0.26.

But giving a relatively big number of potential explanatory variables it is not a straightforward task to select the best set for building a model. A naïve solution would be testing all the possible combinations of the explanatory variables and taking a model which gives the lowest prediction error. The problem is that for a data set with 34 free parameters the number of possible combinations of explanatory variables would be:

\[ C = \sum_r^{34} \frac{34!}{r!\cdot(34 - r)!} \equiv 1.7179869\times 10^{10} \] which renders this idea rather impractical. But to assess the properties of models built with different explanatory variables using only a minor fraction of all possible combinations would suffice. Therefore to investigate the models built with various combinations of different number of explanatory variables the following approach is applied. For each size of the explanatory vector a random selection of the explanatory variables is used to train a model and obtain the model training and prediction errors. This procedure is repeated N times for each explanatory vector size to get multiple model realizations. For each model the model formula, size of the explanatory vector, prediction and training errors are stored in the final data frame for the following analysis.

possible_predictors <- head(names(alc), n = -3)
possible_predictors <- possible_predictors[! possible_predictors %in% c('Walc', 'Dalc', 'alc_use')]

mdf <- data.frame(numeric(0), logical(0), numeric(0), character(0), stringsAsFactors = FALSE)
names(mdf) <- c('size', 'training', 'error', 'model')

num_possible_predictors <- length(possible_predictors)
for(i in 30:1) {
  # Do not test all the combinations for models with more than
  # three predictors, it takes way too much time.
  if(i <= 3) {
    selected_predictors <- gtools::combinations(
        n=num_possible_predictors
      , r=i
      , v=possible_predictors
      , repeats.allowed=FALSE)
  } else {
    selected_predictors <- t(replicate(400, sample(possible_predictors, size = i)))
  }

  for(r in 1:nrow(selected_predictors)) {
    predictors <- selected_predictors[r,]
    
    f_string <- paste('high_use ~ ', paste(predictors, collapse = '+'))
    m <- tryCatch(
      glm(as.formula(f_string), data = alc, family = "binomial"), 
      error=function(e){return(NULL)}, 
      warning=function(w) {return(NULL)})
    
    if(!is.null(m)) {
      # Calculate training error
      alc_loc <- data.frame(alc)
      p <- predict(m, type = "response")
      alc_loc <- alc_loc %>% mutate(probability = p, prediction = probability > 0.5)
      loss <- loss_func(class = alc_loc$high_use, prob = alc_loc$probability)
      
      cv <- cv.glm(data = alc_loc, cost = loss_func, glmfit = m, K = 10)

      # Store training and testing errors
      mdf[nrow(mdf) + 1,] <- list(i, TRUE,  loss,        f_string)
      mdf[nrow(mdf) + 1,] <- list(i, FALSE, cv$delta[1], f_string)
    }
  }
}

As could be seen from the summary figure, the model errors tend to become smaller with increasing the size of the explanatory vector (there training indicates training errors if TRUE and prediction errors if FALSE). This could be attributed to the non-totality of the selected approach which makes it easier to draw a set of ‘bad’ variables if the explanatory vector is short. From the other side for long explanatory vectors there is a higher probability of getting ‘good’ variables from a random draw.

mdf$size <- as.factor(mdf$size)
mdf %>% 
ggplot(aes(x = size, y = error, col=training)) + 
  geom_boxplot() + 
  theme_bw() +
  theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())

From all the tested models, the following give the lowest prediction error:

mdf[mdf$training == FALSE,] %>% slice_min(error)
##   size training     error
## 1   13    FALSE 0.2041885
## 2    9    FALSE 0.2041885
##                                                                                                   model
## 1 high_use ~  activities+Mjob+internet+studytime+absences+paid+sex+reason+nursery+Fedu+age+goout+higher
## 2                               high_use ~  reason+G2+activities+famsup+absences+sex+goout+Medu+nursery

The model with the the lowest prediction error (there the model with fewer explanatory variables was taken) shows some improvement in the prediction skill though still tends to give a considerable amount of false negative predictions according to the provided summary:

best_score <- mdf[mdf$training == FALSE,] %>% slice_min(error)
best_formula <- as.formula(best_score$model[2])
better_model <- glm(as.formula(best_formula), data = alc, family = "binomial")
summary(better_model)
## 
## Call:
## glm(formula = as.formula(best_formula), family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0255  -0.7642  -0.4861   0.8064   2.7573  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.51951    0.82443  -4.269 1.96e-05 ***
## reasonhome        0.18904    0.32007   0.591 0.554773    
## reasonother       1.09637    0.45591   2.405 0.016182 *  
## reasonreputation -0.20941    0.35264  -0.594 0.552630    
## G2               -0.02247    0.04843  -0.464 0.642723    
## activitiesyes    -0.47083    0.26752  -1.760 0.078409 .  
## famsupyes         0.04418    0.27130   0.163 0.870655    
## absences          0.08698    0.02364   3.679 0.000234 ***
## sexM              1.03519    0.27176   3.809 0.000139 ***
## goout             0.76352    0.12690   6.017 1.78e-09 ***
## Medu             -0.03218    0.12743  -0.253 0.800646    
## nurseryyes       -0.45547    0.32090  -1.419 0.155799    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 373.58  on 370  degrees of freedom
## AIC: 397.58
## 
## Number of Fisher Scoring iterations: 4
p2 <- predict(better_model, type = "response")
a2 <- data.frame(alc)
a2 <- a2 %>% mutate(probability = p2, prediction = probability > 0.5)

table(high_use = a2$high_use, prediction = a2$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   251   17
##    TRUE     57   57
a2 %>%
ggplot(aes(x = probability, y = high_use)) +
  geom_violin(trim = TRUE, fill='#DCDCDC') +
  geom_vline(xintercept = 0.5) +
  xlim(0,1) +
  theme_bw() +
  theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())


Clustering and classification

A short summary and things learned

In this exercise we use one of the standard data sets provided with R-language within the MASS package:

data("Boston")

The data set we use is called Boston and it provides information about the housing values in suburbs of Boston as well as quite an extensive set of additional information:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

where

In total, the Bostan data sets has 506 observations of 14 individual variables.

Having such a considerable number of different variables, it could be interesting to investigate their distributions and relationships between each other. First the estimated distributions for all the numeric variables are computed:

boston_labels <- c(
    'crime rate'
  , 'fraction of lots > 25000 sq.ft.'
  , 'fraction of non-retail business land'
  , 'NOx concentration'
  , 'number of rooms'
  , 'fraction of buldings from before 1940'
  , 'distance to Boston employment centres'
  , 'radial highway accessibility index'
  , 'tax per $10000'
  , 'pupil-teacher ratio'
  , 'black/white prevalence'
  , 'lower status percent'
  , 'house median value'
)

i <- 1
for(c in names(Boston)) {
  if(c == 'chas') next
  p <- ggally_densityDiag(Boston, mapping = aes_string(x=c)) +
       xlab(boston_labels[i]) +
       theme_bw() +  
       theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), aspect.ratio=0.25)
  
  var_name <- paste('p', i, sep='')
  assign(var_name, p)
  i <- i + 1
}

grid.arrange(
    p1, p2, p3, p4,   p5,  p6
  , p7, p8, p9, p10, p11, p12, p13, nrow = 7, ncol=2)

These distributions provide some insight on the investigated data set. For example the crime rate is generally low and most suburbs have either prevalent black or prevalent non-black population. Towns with mixed population seem to be less common. A considerable number of buildings in Boston suburbs is built before 1940, there are not many big houses with a lot of rooms and house value does not vary too much from town to town. Another point to notice there is that several variables in the Boston data set show bimodal distribution. These variables are indus (fraction of non-retail business land), rad (radial highway accessibility index), and tax (tax per $10000) and they indicate some spatial irregularity in the Boston area. For example, it seems that there are locations with either good connections to radial highways or bad ones and not many locations with intermediate values of the accessibility index.

To get an overview on the relations between different variables correlation coefficients could be studied. To get a more clear picture in case of a considerable number of individual variables correlations in graphical form seem to be a good choice.

cor_matrix <- cor(Boston) 
cor_matrix <- cor_matrix %>% round(2)
corrplot(cor_matrix, method="circle", tl.cex=0.6)

This plot shows that there is a number of variables in the Boston data set which show relatively strong correlations between each other. Some of these correlations follow my expectations, for example, the median price of buildings shows negative correlation with crime rate, or areas with higher fraction of non-retail business also have higher NOx levels. From the other side, some of the correlations are rather surprising to me, for example the number of rooms in a house seems to be negatively correlated with the property tax, though this correlation is weak.

The summary of the Boston data set provided by the summary function gives some additional details one the values of each individual variable supporting the conclusions from the figures.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

One thing to note there is that all the variables have non-zero mean values.

Before further processing of the data set it should be standardized because the original Boston data have variables with different units and scales which could affect the following analysis. To perform the standardization the R-function scale is applied:

boston_scaled <- scale(Boston) %>% as.data.frame
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

note that after standardization procedure the mean values of all the variables in the data set are zero. This procedure also changes the standard deviations of all the variables to be equal to one.

One of the tasks for this week exercise is to use linear discriminant analysis for classifying and predicting the crime rate in Boston suburbs. To do so, the standardized data set should be further processed. First, the crime rate (the crim variable), which is numeric in the original data set is converted to categorical form. Original variable is split in four categories representing low, medium low, medium high and high crime rate according to the quantiles of crim in the data set:

bins <- quantile(boston_scaled$crim)
labels <- c('low', 'med_low', 'med_high', 'high')
crime <- cut(boston_scaled$crim, breaks = bins, label=labels, include.lowest = TRUE)

boston_scaled <- select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Next, to be able to examine the predictive skill of the model, the full data set is split in two parts: training data set which consists of 80% of the original data and testing data containing the rest 20% of the data set.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Now an LDA model could be trained with the crime rate as a target variable:

lda.fit <- lda(crime ~ ., data = train)

ggplot.lda.arrows <- function(p, x, myscale = 1, arrow_heads = 0.1, color = "red", choices = c(1,2)){
  heads <- coef(x)
  
  vals <- c(1:length(heads[,1]))
  for(i in vals) {
    p <- p + geom_segment(
        x = 0
      , y = 0
      , xend = myscale * heads[i,choices[1]]
      , yend = myscale * heads[i,choices[2]]
      , color = color
      , arrow = arrow(length = unit(arrow_heads, 'cm')))
  
    p <- p + annotate(
        'text'
      , x=myscale*(heads[i,choices[1]] + 0.1)
      , y=myscale*(heads[i,choices[2]] + 0.1)
      , label = row.names(heads)[i],
      , col=color)
  }
  return(p)
}

lda.data <- data.frame(type = train[,1], lda = predict(lda.fit)$x)
p <- ggplot(lda.data) +
  geom_point(aes(lda.LD1, lda.LD2, color = train$crime), alpha=0.5, size = 2.5) 
p <- ggplot.lda.arrows(p, lda.fit, myscale = 2) +
  theme_bw() +
  theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())

p

The LDA plot shows a well separated group group of points corresponding to the high crime rate. It seems that the rad variable is the main contributor there and because rad corresponds to the accessibility index for the radial highways it means that areas with high accessibility tend to show higher crime rates. Other categories of crime rates form a somewhat merged group, although medium high rate could still be distinguished from low rate.

Isolated location of the high crime cluster indicates that the trained model should have a considerable predictive skill. To investigate this, the testing part of the data set is used to make model predictions and compare them with actual values:

correct_classes <- test$crime
test <- select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      11        1    0
##   med_low    4      19        4    0
##   med_high   0       9       18    2
##   high       0       0        0   22

The cross tabulation table shows that indeed for high and medium high crime rates the LDA classificator has a considerable predictive skill with low amount of false negative predictions. For the medium low and low crime rates the models shows less skill.

So far I used the available information about the crime rates to build an LDA model and predict the crime rates of the testing data set. But clustering information is not always available from a data set. When such information is missing a clustering algorithm could be applied to find potentially distinctive groups. To illustrate an application of such approach the full Boston data set is used for finding separate clusters in it.

Again, as a first step the Boston data set is standardized, to remove different units of measure and normalize distances:

data('Boston')
boston_scaled <- Boston %>% scale %>% as.data.frame

In this part the k-means algorithm will be applied, which uses distances between individual observations and cluster centers. To show distances between the individual points of the Boston data set it could be passed as an argument to the distance function:

dist_eucl <- dist(boston_scaled)
summary(dist_eucl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

it could be seen that standardized distances are rather short and are not affected by the magnitude of values from the original data set.

The k-means algorithm requires information about the number of clusters to be provided by user. Although giving an arbitrary selected number of cluster would often work:

km <-kmeans(boston_scaled, centers = 3)
pairs(Boston, col = km$cluster)

estimating the optimal number of clusters is less trivial. The optimal number of classes could be estimated from the analysis of within-cluster-sum-of-squares or WCSS. When the value of WCSS is plotted against the total number of clusters, the optimal number of clusters would be indicated by a sharp drop in the WCSS value.

In the following example the k-means algorithm is applied 20 times with different numbers of cluster and the WCSS plot is provided:

k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line') +
  geom_point() +
  geom_line() +
  theme_bw() + 
  theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())

as could be seen from the figure, the sharpest drop in the WCSS value happens with 2 clusters, which indicates that this would be the optimal number of clusters.

Using this estimated optimal number of clusters the k-means algoritm would give the following result:

km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

from the summary figure it is not very clear what is the physical meaning (if any) behind the split parts of the data set. Estimated distributions by cluster give some idea on the meaning of two clusters (and here we use the original data set as having more clear physical meaning):

i <- 1
boston_km <- data.frame(Boston, cluster=as.factor(km$cluster))

for(c in names(boston_scaled)) {
  if(c == 'chas') next
  z <- 'cluster'
  p <- ggally_densityDiag(boston_km, mapping = aes_string(x=c,color='cluster'), alpha=0.5) +
       xlab(boston_labels[i]) +
       theme_bw() +  
       theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), aspect.ratio=0.25)
  
  var_name <- paste('p', i, sep='')
  assign(var_name, p)
  i <- i + 1
}

grid.arrange(
    p1, p2, p3, p4,   p5,  p6
  , p7, p8, p9, p10, p11, p12, p13, nrow = 7, ncol=2)

clusters seem to divide some variables in logical way, like cluster 1 showing higher taxes and older houses, while cluster 2 has lower highway accessibility index and lower NOx.

Clusters found by the k-means algorithm could be used a target variable, as shown in the following example. There output from k-means with 4 clusters is added to the Boston data set to train an LDA model:

data('Boston')
boston_scaled <- Boston %>% scale %>% as.data.frame
km <-kmeans(boston_scaled, centers = 4)
boston_scaled_km <- data.frame(boston_scaled, cluster=km$cluster)

lda_km.fit <- lda(cluster ~ ., data = boston_scaled_km)

lda_km.data <- data.frame(type = boston_scaled[,1], lda = predict(lda_km.fit)$x)
p <- ggplot(lda_km.data) +
  geom_point(aes(lda.LD1, lda.LD2, color = as.factor(km$cluster)), alpha=0.5, size = 2.5) 
p <- ggplot.lda.arrows(p, lda_km.fit, myscale = 2) +
  theme_bw() +
  theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())

p

The summary figure shows that one of the cluster is associated with the chas variable and another with the rad variable while other clusters are less pronounced. This rather striking impact of chas could be caused to some extent by the nature of this variable because in the Boston data set chas is a dummy variable taking values of 0 or 1, which breaks the assumptions of LDA.

Another approach for visualizing an LDA model is provided on the following figure:

model_predictors <- select(train, -crime)

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(
    x = matrix_product$LD1
  , y = matrix_product$LD2
  , z = matrix_product$LD3
  , type= 'scatter3d'
  , color=train$crime
  , size=2
  , mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

on this figure each axis corresponds to the individual linear discriminant and unlike the 2D plot this figure shows more complete information about the fitted LDA.

A similar figure could be obtained by coloring points not by crime rate but rather by clusters found by the k-means algorithm:

km <-kmeans(select(train, -crime), centers = 4)
plot_ly(
    x = matrix_product$LD1
  , y = matrix_product$LD2
  , z = matrix_product$LD3
  , type= 'scatter3d'
  , color=as.factor(km$cluster)
  , size=2
  , mode='markers')

both figures show somewhat similar features with a distinctive separate group (which corresponds to high crime rates according to LDA) almost exclusively occupied by a single cluster. Patterns in the main group of points although showing some similarity are less obvious.


Dimensionality reduction techniques

A short summary and things learned

In the first part of this week’s exercise the principal component analysis (PCA) is applied to the data set, which provides information about the economic level, life quality and gender inequality in different countries. As the usual first step, the data set in question should be loaded from the file system:

human <- read.csv('human.csv', row.names = 1)
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ edu2Ratio   : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labRatio    : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ eduE        : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifeE       : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ gni         : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mmRatio     : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adlBirthRate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parliamentF : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

This data set includes information about 155 different countries for 8 variables which include the following parameters:

To get a better summary of the analyzed data set, probability density functions could be estimated for each of the variables:

get_mode <- function(x) {
  d <- density(x)
  d$x[which.max(d$y)]
}

i <- 1
for(c in names(human)) {
  m <- get_mode(human[[c]])[1]
  if(c == 'labRatio') m_labRatio <- m
  p <- ggally_densityDiag(human, mapping = aes_string(x=c)) +
       geom_vline(xintercept = m) +
       scale_x_continuous(sec.axis = dup_axis(breaks=m, name=NULL)) +
       theme_bw() +  
       theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), aspect.ratio=0.25)
  
  var_name <- paste('p', i, sep='')
  assign(var_name, p)
  i <- i + 1
}

grid.arrange(
    p1, p2, p3, p4
  , p5, p6, p7, p8, nrow = 4, ncol=2)

As could be seen from the figures, all the distributions have only one well-pronounced maximum. In general for most of the countries the ration between the female and male population receiving the secondary education is a bit less to one indicating the female population is on a par with male population. For the labor force participation ratio the situation is somewhat similar, though the maximum of the estimated distribution at 0.82 suggests that female part of the population is less involved in the labour force participation. Estimated distributions for the gross national income, maternal mortality and adolescent birth rate have low maximum probability values and long tail which could be caused by difference in development level between countries.

To get a general overview of the studied data set the pairs function could be very useful:

pairs(human)

The figure indicates that some variables are strongly connected to each other, for example eduE and lifeE, but to get a better overview of the relations between variable a correlation plot would be more helpful:

cor_matrix <- cor(human) 
cor_matrix <- cor_matrix %>% round(2)
corrplot(cor_matrix, method='circle', tl.cex=0.6)

The correlation plot shows that variables in the data set could be split in three groups. The first group includes eduE, lifeE, and gni; the second one includes mmRatio, and adlBirthRate; and the third group contains the rest of the variables. Apparently, the first and the second groups show strong negative correlation, i.e. for well developed countries with high live expectancy and expected years of education the maternal mortality and adolescent birth rate tends to be low and vice versa. The third group which includes labRatio and parliamentF show weak connection to other variables.

As a first step PCA is applied to the original data set:

pca_human_no_std <- prcomp(human)
summary(pca_human_no_std)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

As could be seen from the summary table the result of PCA is strongly affected by uneven scaling of the variables in the original data set. Even though the algorithm reports that the first and second principal components capture all the variability of the data set (with 99.99% for PC1 and 0.01% for PC2) it does not provide too much information for analysis.

This issue could be illustrated by the following plot:

s <- summary(pca_human_no_std)
pca_pr <- round(100*s$importance[2, ], digits = 3)
pc_lab <- paste0(names(pca_pr), ' (', pca_pr, '%)')

biplot(
    pca_human_no_std
  , choices = 1:2
  , cex=c(0.8,1)
  , col=c('grey40', 'deeppink2')
  , xlab = pc_lab[1]
  , ylab = pc_lab[2])
Biplot for PCA applied to the non-standardized data set with PC1 connected to gross national income and PC2 showing a weak connetction to the maternal mortality and adolsecent birth ratios.

Biplot for PCA applied to the non-standardized data set with PC1 connected to gross national income and PC2 showing a weak connetction to the maternal mortality and adolsecent birth ratios.

While it is clear that PC1 is connected to the gross national income, contributions from the other variables are so small that it’s almost impossible to assess. Although, the figure itself suggests that PC2 is connected to such variables as mmRatio and adlBirthRate.

To make the data set easier to analyze the standardization procedure should be applied. If PCA is applied to the standardized data set contributions from individual variables would have comparable weights:

human_std <- scale(human)
pca_human_std <- prcomp(human_std)

s <- summary(pca_human_std)
pca_pr <- round(100*s$importance[2, ], digits = 3)
pc_lab <- paste0(names(pca_pr), ' (', pca_pr, '%)')

biplot(
    pca_human_std
  , choices = 1:2
  , cex=c(0.8,1)
  , col=c('grey40', 'deeppink2')
  , xlab = pc_lab[1]
  , ylab = pc_lab[2])
Biplot for PCA applied to the standardized data set. PC1 is connected to the gross capital income, life excpectancy at birth, expected years of education, secondary education ratio of female and male population, maternal mortality ratio and adolescent birth rate. PC2 is connetcted to labrour force participation ration for female and male population and fraction of female members in parliament.

Biplot for PCA applied to the standardized data set. PC1 is connected to the gross capital income, life excpectancy at birth, expected years of education, secondary education ratio of female and male population, maternal mortality ratio and adolescent birth rate. PC2 is connetcted to labrour force participation ration for female and male population and fraction of female members in parliament.

As could be seen from the updated figure, it is very different from the result obtained with non-standardized data set. The first analysis was strongly influenced by the large values of the gni variable which effectively hide the variability associated with other variables. PCA applied to the standardized data set show no such feature and has more clear connection between variables. The contributions of original variables provided on the figure could be split in two almost orthogonal sets. One of them includes such variables as edu2Ratio, eduE, lifeE, gni, mmRatio, adlBirthRate while the second group consists of labRatio and parliamentF variables. This division between the contributing variables could be explained in the following way. The first principal component shows the general development level of the countries because it is influenced by such variables as gross national income or life expectancy at birth, which are high for the developed countries, and maternal mortality ratio which is expected to be high in non-developed countries. The second principal component, which is connected to the fraction of female members in parliament and participation in labour force of female population, shows the involvement of the female population in the economic and political life.

When a data set includes qualitative variables, multiple correspondence analysis (MCA) could be applied to investigate patterns in the data or reduce the dimensionality. In this exercise MCA is applied to the tea data set, which is provided as a part of the FactoMineR package:

data('tea')

tea <- tea %>% select(-age)

dim(tea)
## [1] 300  35
str(tea)
## 'data.frame':    300 obs. of  35 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

As could be seen from the summary information, this data set provides 300 individual observation of 35 variables which represent a survey on tea. Respondents provided information about how they drink tea, their perception of tea products, and personal details. All the variables in the data set are factor variables except for age, which is excluded from the analysis.

The general overview of the tea data set could be given by the following series of the bar plots:

gather(tea) %>% ggplot(aes(value)) +
  geom_bar() +
  theme_bw() +
  theme(
      panel.grid.major=element_blank()
    , panel.grid.minor=element_blank()
    , axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) +
  facet_wrap('key', scales = 'free')

This figure provides some interesting details about the tea drinking habits of the respondents. For example, most of them drink tea at home and do not add lemon or milk.

While it is possible to apply MCA to the whole data set, it would be difficult to interpret the result due to a large number of individual parameters. Therefore, in this exercise MCA is applied to a small selected subset of the tea variables:

keep_columns <- c('Tea', 'work', 'How', 'resto', 'where', 'price')
tea_sel <- select(tea, one_of(keep_columns))

mca <- MCA(tea_sel, graph = FALSE)

The summary of the MCA results could be provided with a usual biplot figure:

plot(mca, invisible=c('ind'), habillage = 'quali') 

The figure suggests that the x-axis is connected to the tea type and the place where a respondent prefers to drink tea, while the y-axis seems to be related to the place where a respondent buys their tea and its price.

Adding the individual responses to the plot could reveal some patterns in the data set:

plot(mca, label='var', habillage = 'quali') 

According to the figure it seems that green tea is not in great favor and respondents prefer to drink black or Earl Grey. Although it seems that respondents who buy more expensive tea in tea shops show some tendency towards drinking green tea.


(more chapters to be added similarly as we proceed with the course!)